1,960 research outputs found
REPRESENTATION LEARNING FOR ACTION RECOGNITION
The objective of this research work is to develop discriminative representations for human
actions. The motivation stems from the fact that there are many issues encountered while
capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration),
inter-action similarity, background motion, and occlusion of actors. Hence, obtaining
a representation which can address all the variations in the same action while maintaining
discrimination with other actions is a challenging task. In literature, actions have been represented
either using either low-level or high-level features. Low-level features describe
the motion and appearance in small spatio-temporal volumes extracted from a video. Due
to the limited space-time volume used for extracting low-level features, they are not able
to account for viewpoint and actor variations or variable length actions. On the other hand,
high-level features handle variations in actors, viewpoints, and duration but the resulting
representation is often high-dimensional which introduces the curse of dimensionality. In
this thesis, we propose new representations for describing actions by combining the advantages
of both low-level and high-level features. Specifically, we investigate various linear
and non-linear decomposition techniques to extract meaningful attributes in both high-level
and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build
action-specific dictionaries. Each dictionary retains only the discriminative information
for a particular action and hence reduces inter-action similarity. Then, a sparsity-based
classification method is proposed to classify the low-rank representation of clips obtained
using these dictionaries. We show that this representation based on dictionary learning improves
the classification performance across actions. Also, a few of the actions consist of
rapid body deformations that hinder the extraction of local features from body movements.
Hence, we propose to use a dictionary which is trained on convolutional neural network
(CNN) features of the human body in various poses to reliably identify actors from the
background. Particularly, we demonstrate the efficacy of sparse representation in the identification
of the human body under rapid and substantial deformation.
In the first two approaches, sparsity-based representation is developed to improve discriminability
using class-specific dictionaries that utilize action labels. However, developing
an unsupervised representation of actions is more beneficial as it can be used to both
recognize similar actions and localize actions. We propose to exploit inter-action similarity
to train a universal attribute model (UAM) in order to learn action attributes (common and
distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation,
a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains
redundant attributes of all other actions, we use factor analysis to extract a novel lowvi
dimensional action-vector representation for each clip. Action-vectors are shown to suppress
background motion and highlight actions of interest in both trimmed and untrimmed
clips that contributes to action recognition without the help of any classifiers.
It is observed during our experiments that action-vector cannot effectively discriminate
between actions which are visually similar to each other. Hence, we subject action-vectors
to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic
LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary
information across action-vectors using different local features followed by discriminative
embedding provides the best classification performance. Further, we explore
non-linear embedding of action-vectors using Siamese networks especially for fine-grained
action recognition. A visualization of the hidden layer output in Siamese networks shows
its ability to effectively separate visually similar actions. This leads to better classification
performance than linear embedding on fine-grained action recognition.
All of the above approaches are presented on large unconstrained datasets with hundreds
of examples per action. However, actions in surveillance videos like snatch thefts are
difficult to model because of the diverse variety of scenarios in which they occur and very
few labeled examples. Hence, we propose to utilize the universal attribute model (UAM)
trained on large action datasets to represent such actions. Specifically, we show that there
are similarities between certain actions in the large datasets with snatch thefts which help
in extracting a representation for snatch thefts using the attributes from the UAM. This
representation is shown to be effective in distinguishing snatch thefts from regular actions
with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing
actions which provide better discrimination than existing representations. The
first approach presents a dictionary learning based sparse representation for effective discrimination
of actions. Also, we propose a sparse representation for the human body based
on dictionaries in order to recognize actions with rapid body deformations. In the next
approach, a low-dimensional representation called action-vector for unsupervised action
recognition is presented. Further, linear and non-linear embedding of action-vectors is
proposed for addressing inter-action similarity and fine-grained action recognition, respectively.
Finally, we propose a representation for locating snatch thefts among thousands of
regular interactions in surveillance videos
Medical images modality classification using multi-scale dictionary learning
In this paper, we proposed a method for classification of medical images captured by different sensors (modalities) based on multi-scale wavelet representation using dictionary learning. Wavelet features extracted from an image provide discrimination useful for classification of medical images, namely, diffusion tensor imaging (DTI), magnetic resonance imaging (MRI), magnetic resonance angiography (MRA) and functional magnetic resonance imaging (FRMI). The ability of On-line dictionary learning (ODL) to achieve sparse representation of an image is exploited to develop dictionaries for each class using multi-scale representation (wavelets) feature. An experimental analysis performed on a set of images from the ICBM medical database demonstrates efficacy of the proposed method
View and Illumination Invariant Object Classification Based on 3D Color Histogram Using Convolutional Neural Networks
Object classification is an important step in visual recognition and semantic analysis of visual content. In this paper, we propose a method for classification of objects that is invariant to illumination color, illumination direction and viewpoint based on 3D color histogram. A 3D color histogram of an image is represented as a 2D image, to capture the color composition while preserving the neighborhood information of color bins, to realize the necessary visual cues for classification of objects. Also, the ability of convolutional neural network (CNN) to learn invariant visual patterns is exploited for object classification. The efficacy of the proposed method is demonstrated on Amsterdam Library of Object Images (ALOI) dataset captured under various illumination conditions and angles-of-view
Learning sparse dictionaries for music and speech classification
The field of music and speech classification is quite
mature with researchers having settled on the approximate best
discriminative representation. In this regard, Zubair et al. showed
the use of sparse coefficients alongwith SVM to classify audio
signals as music or speech to get a near-perfect classification. In
the proposed method, we go one step further, instead of using
the sparse coefficients with another classifier they are directly
used in a dictionary which is learned using on-line dictionary
learning for music-speech classification. This approach removes
the redundancy of using a separate classifier but also produces
complete discrimination of music and speech on the GTZAN
music/speech dataset. Moreover, instead of the high-dimensional
feature vector space which inherently leads to high computation
time and complicated decision boundary calculation on the part
of SVM, the restricted dictionary size with limited computation
serves the same purpose
Music genre classification using On-line Dictionary Learning
In this paper, an approach for music genre classification based on sparse representation using MARSYAS features is proposed. The MARSYAS feature descriptor consisting of timbral texture, pitch and beat related features is used for the classification of music genre. On-line Dictionary Learning (ODL) is used to achieve sparse representation of the features for developing dictionaries for each musical genre. We demonstrate the efficacy of the proposed framework on the Latin Music Database (LMD) consisting of over 3000 tracks spanning 10 genres namely Axé, Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja and Tango
Dictionary based action video classification with action bank
Classifying action videos became challenging problem
in computer vision community. In this work, action videos are
represented by dictionaries which are learned by online dictionary
learning (ODL). Here, we have used two simple measures
to classify action videos, reconstruction error and projection.
Sparse approximation algorithm LASSO is used to reconstruct
test video and reconstruction error is calculated for each of the
dictionaries. To get another discriminative measure projection,
the test vector is projected onto the atoms in the dictionary.
Minimum reconstruction error and maximum projection give
information regarding the action category of the test vector. With
action bank as a feature vector, our best performance is 59.3%
on UCF50 (benchmark is 57.9%), 97.7% on KTH (benchmark
is 98.2%)and 23.63% on HMDB51 (benchmark is 26.9%)
Recent Advances in mmWave-Radar-Based Sensing, Its Applications, and Machine Learning Techniques: A Review
Human gesture detection, obstacle detection, collision avoidance, parking aids, automotive driving, medical, meteorological, industrial, agriculture, defense, space, and other relevant fields have all benefited from recent advancements in mmWave radar sensor technology. A mmWave radar has several advantages that set it apart from other types of sensors. A mmWave radar can operate in bright, dazzling, or no-light conditions. A mmWave radar has better antenna miniaturization than other traditional radars, and it has better range resolution. However, as more data sets have been made available, there has been a significant increase in the potential for incorporating radar data into different machine learning methods for various applications. This review focuses on key performance metrics in mmWave-radar-based sensing, detailed applications, and machine learning techniques used with mmWave radar for a variety of tasks. This article starts out with a discussion of the various working bands of mmWave radars, then moves on to various types of mmWave radars and their key specifications, mmWave radar data interpretation, vast applications in various domains, and, in the end, a discussion of machine learning algorithms applied with radar data for various applications. Our review serves as a practical reference for beginners developing mmWave-radar-based applications by utilizing machine learning techniques.publishedVersio
Defining Traffic States using Spatio-temporal Traffic Graphs
Intersections are one of the main sources of congestion and hence, it is
important to understand traffic behavior at intersections. Particularly, in
developing countries with high vehicle density, mixed traffic type, and
lane-less driving behavior, it is difficult to distinguish between congested
and normal traffic behavior. In this work, we propose a way to understand the
traffic state of smaller spatial regions at intersections using traffic graphs.
The way these traffic graphs evolve over time reveals different traffic states
- a) a congestion is forming (clumping), the congestion is dispersing
(unclumping), or c) the traffic is flowing normally (neutral). We train a
spatio-temporal deep network to identify these changes. Also, we introduce a
large dataset called EyeonTraffic (EoT) containing 3 hours of aerial videos
collected at 3 busy intersections in Ahmedabad, India. Our experiments on the
EoT dataset show that the traffic graphs can help in correctly identifying
congestion-prone behavior in different spatial regions of an intersection.Comment: Accepted in 23rd IEEE International Conference on Intelligent
Transportation Systems September 20 to 23, 2020. 6 pages, 6 figure
Sparseland model for speckle suppression of B-mode ultrasound images
Speckle is a multiplicative noise which is inherent in medical ultrasound images. Speckles contributes high variance between neighboring pixels reducing the visual quality of an image. Suppression of speckle noise significantly improves the diagnostic content present in the image. In this paper, we propose how sparseland model can be used for speckle suppression. The performance of the model is evaluated based on variance to mean ratio of a patch in the filtered image. The algorithm is tested on both software generated images and real time ultrasound images. The proposed algorithm has performed similar to past adaptive speckle suppression filters and seems promising in improving diagnostic content
- …